Streaming k-means approximation

نویسندگان

  • Nir Ailon
  • Ragesh Jaiswal
  • Claire Monteleoni
چکیده

We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Constant Approximation for Streaming k-means

This article gives a constant factor approximation algorithm for streaming k-means that usesO(k log n) space.

متن کامل

Streaming k-means on Well-Clusterable Data

One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)approximation for k-means clustering in the streaming setting. Unfortunately,...

متن کامل

Achieving Approximate Soft Clustering in Data Streams

In recent years, data streaming has gained prominence due to advances in technologies that enable many applications to generate continuous flows of data. This increases the need to develop algorithms that are able to efficiently process data streams. Additionally, real-time requirements and evolving nature of data streams make stream mining problems, including clustering, challenging research p...

متن کامل

Streaming Algorithms for k-Means Clustering with Fast Queries

We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. When compared with the current state-of-the-art, our methods provide a substantial improvement in the time to answer a query for cluster centers, while retaining the desirable properties of provably small approximation error, and low space usage. Our algorithms are based on a no...

متن کامل

Efficient Clustering Algorithms for Out of Core, Distributed and Streaming Data

Clustering has been one of the most widely studied topics in data mining and it is often the first step of data mining process. Today, we are witnessing enormous growth in data volume. Often, data is distributed or it can be in the form of streaming data. Efficient clustering in all these scenario becomes a very challenging problem. Our work is in the context of k-means clustering algorithm. k-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009